HMA colony analysis
Colony data only
This report is only on the colony data (RNA-seq + TEM-seq) analysis for MV-4-11.
library(knitr)
This is based directly on the annotation (GTF) file used The output of TEtranscripts is ensembl ID and counts. Therefore, this table provides a way to update the output with the information already contained within the gene GTF file (annotation file) used in the TEtranscripts run. This code was based on https://www.biostars.org/p/140471/
N.B. These should be the same gene GTF files used in the TEtranscripts step!
library(rtracklayer)
library(dplyr)
library(readr)
# Version 1 (gencode)
GTF_file <- "./scripts/annos/GTF_files/genes/gencode.v30.annotation_MT_noChr_corrected.gtf" #v2 was using gencode.v30.annotation_NoChr.gtf (only minor update - 'MT' name)
newGTF <- import(GTF_file)
newGTF2 <- as.data.frame(newGTF)
# Extract only genes and export standardised table (used for downstream analysis)
newGTF2 <- newGTF2 %>%
filter(type == "gene") %>%
dplyr::select(Geneid = gene_id, GeneSymbol = gene_name,
Chromosome = seqnames, Start = start, End = end,
Class = gene_type, Strand = strand, Length = width) %>%
mutate(Chromosome = gsub("chr", "", Chromosome)) ## Removes 'chr' in case GTF file was from ensembl (and not already removed)
write_tsv(newGTF2, file = "./scripts/annos/gencode.v30_gene_annotation_table3.txt")
This chunk takes a while due to the slow import of the merged count tables (particularly for larger files like the colony dataset).
source("./scripts/TEtranscripts_Update_and_standardise_output_table.R")
## MV colony update
update_table(data = "./1_data/colony/MV/TEtranscripts/TEtranscripts_Merged.cntTable",
anno_file = "./scripts/annos/gencode.v30_gene_annotation_table3.txt",
output.dir = "./2_results/colony/MV",
output.appendix = "")
This is currently only written to work with the colony data, to produce a saved sce object (stored in an R object), which also contains associated metadata for genes/ TEs and samples.
Add Semqonk QC into sce (if available). This will update sce object metadata, and the meta data tsv file (located in the output.directory)
#Don't need to run this if it was already run in the QC index. Provided here for reference (or if it needs to be 're-run')
source("./scripts/QC_seqmonk_sample_filtering_function.R")
Sample_QC_filter(QC.file = "./2_results/colony/QC/QC_stats_seqmonk_hisat2_summarised_ALL_colonies.csv",
cell = "MV",
overall.alignment.rate.threshold = 0,
Percent.Genes.Measured.threshold = 35,
Percent.in_exons.threshold = 65,
output.folder = "./2_results/colony/MV/QC")
source("./scripts/create_sce_add_seqmonk_QC.R")
create_sce_add_seqmonk_qc(
input.sce = "./2_results/colony/MV/sce.rds",
seqmonk.QC.passing = "./2_results/colony/MV/QC/Samples_passing_QC.txt",
output.directory = "./2_results/colony/MV")
This is to run the QC and filtering script:
# This includes initial QC and secondary filtered QC
source("./scripts/Perform_sce_QC_and_filter-seqmonk_based.R")
perform_QC_seqmonk_based(input.file = "2_results/colony/MV/sce.rds",
output.file.dir = '2_results/colony/MV',
output.fig = 'figures/colony/MV/QC',
min.counts = 5,
min.cells = 3)
The initial QC is utilising the scran::scater() package.
There is clear difference in library size between the three packages, with rep 2 and 3, showing a minor secondary peak with a low library size. | This is further highlighted in the second plot, with both rep 2 and 3 having several cells with few ‘features detected’.
On the PCA plot, a third group separated on PC2 appears to be driven by the ‘number of unique gene features’, with all cells in this separate cluster having the lowest number of genes. Therefore, these cells with be filtered out.
Samples are filtered based on the results from the seqmonk QC (see colony_QC markdown) that produced a list of samples passing QC. Then genes/TEs are filter for above thresholds to remove missing/lowly expressed genes, by requiring x amount of reads in at least Y cells/samples.
scater can also compute the marginal R2 for each variable when fitting a linear model regressing expression values for each gene against just that variable, and display a density plot of the gene-wise marginal R2 values for the variables. | Each line corresponds to one factor and represents the distribution of R2 values across all genes. The factors being regressed/ plotted are ranked, so the ‘top’ factor in the legend is the one that explains ‘the most’ amount of variance in the dataset
Here, we can see ‘batch’ accounts for more variance in expression than treatment in both the initial QC and post filtering. After filtering, number of genes and TEs detected and batch still explain the highest amounts of variance. The subset of mitochondria explains the least amount of variance across this data. Therefore, batch correction needs to applied.
First exploring how the data looks within each individual batch. This will help assess the potential challenges and success of batch correction.
This is using the ‘raw’ data (pre-filtered)
(Running from NOMeSeq directory.)
source("./scripts/converted_from_HPC/per_batch_normalisation_v3.R")
per_batch_normalisation(input.file = "./2_results/colony/MV/sce_qc1.rds",
output.file = "./2_results/colony/MV/Batch_Correction/sce_qc1_UniBatchNorm.rds",
output.fig.dir = 'figures/colony/MV/Batch_Correction/',
per.batch = FALSE # If false, it performs scuttle:logNormalise() for the combined batches but still performs runPCA on the batches separately after.
)
source("./scripts/converted_from_HPC/viz_per_batch_normalisation.R")
viz_per_batch_normalisation(output.file = './2_results/colony/MV/Batch_Correction/sce_qc1_UniBatchNorm.rds', # output RDS path to save the data
output.fig = 'figures/colony/MV/Batch_Correction/', # output folder for figures
output.fig.name = "QC1_Norm_PCA_PerBatch.png", # output name for figures
reducedDim = "PCA.batch",
shape = 1,
size = 3,
width = 8,
height = 4,
all_cols = c("unt" = "green", "aza" = "blue", "dac" = "red") # Change as required for your batch c("rep_name1" = "colour1, "rep_name2 = "colour2")
)
This is using the post-filtered data.
(Running from NOMeSeq directory.)
source("./scripts/converted_from_HPC/per_batch_normalisation_v3.R")
per_batch_normalisation(input.file = "./2_results/colony/MV/sce_qc2.rds",
output.file = "./2_results/colony/MV/Batch_Correction/sce_qc2_UniBatchNorm.rds",
output.fig.dir = 'figures/colony/MV/Batch_Correction/',
per.batch = FALSE # If false, it performs scuttle:logNormalise() for the combined batches but still performs runPCA on the batches separately after.
)
source("./scripts/converted_from_HPC/viz_per_batch_normalisation.R")
viz_per_batch_normalisation(output.file = './2_results/colony/MV/Batch_Correction/sce_qc2_UniBatchNorm.rds', # output RDS path to save the data
output.fig = 'figures/colony/MV/Batch_Correction/', # output folder for figures
output.fig.name = "QC2_Norm_PCA_PerBatch.png", # output name for figures
reducedDim = "PCA.batch",
shape = 1,
size = 3,
width = 8,
height = 4,
all_cols = c("unt" = "green", "aza" = "blue", "dac" = "red") # Change as required for your batch c("rep_name1" = "colour1, "rep_name2 = "colour2")
)
include_graphics('./figures/colony/MV/Batch_Correction/QC1_Norm_PCA_PerBatch.png')
scRNAseq PCA on each batch separately
include_graphics('./figures/colony/MV/Batch_Correction/QC2_Norm_PCA_PerBatch.png')
scRNAseq PCA on each batch separately
MNN tries
to identify pairs of mutual nearest neighbors (MNN) in the
high-dimensional log-expression space. Each MNN pair represents cells in
different batches that are of the same cell type/state, assuming that
batch effects are mostly orthogonal to the biological manifold.
Correction vectors are calculated from the pairs of MNNs and corrected
(log-)expression values are returned
Apply MNN on logcounts to correct for batch effects using all of the genes. Add the corrected logcounts to the SCE. Create and save PCA plots coloured by treatment and batch.
source("./scripts/converted_from_HPC/batch_correction_MNN.R")
batch_correction_mnn(input.file = "./2_results/colony/MV/Batch_Correction/sce_qc2_UniBatchNorm.rds",
output.file = "./2_results/colony/MV/Batch_Correction/sce_qc2_norm_mnnCorrect.rds",
assay.name = "mnnCorrected",
nhvgs = 5000,
force = FALSE)
source("./scripts/converted_from_HPC/viz_batch_normalisation.R")
viz_batch_normalisation(output.file = "./2_results/colony/MV/Batch_Correction/sce_qc2_norm_mnnCorrect.rds", # output RDS path to save the data
output.fig = 'figures/colony/MV/Batch_Correction/', # output folder for figures
output.fig.name = "PCA_mnnCorrected.png", # output name for figures
shape = "Batch",
size = 3,
width = 12,
height = 8,
all_cols = c("unt" = "green", "aza" = "blue", "dac" = "red"), # Change as required for your batch c("rep_name1" = "colour1, "rep_name2 = "colour2")
reducedDim = 'PCA_mnnCorrected')
source("./scripts/converted_from_HPC/viz_variance_explained.R")
viz_variance_explained(sce.rds = c("./2_results/colony/MV/Batch_Correction/sce_qc1_UniBatchNorm.rds",
"./2_results/colony/MV/Batch_Correction/sce_qc2_UniBatchNorm.rds",
"./2_results/colony/MV/Batch_Correction/sce_qc2_norm_mnnCorrect.rds",
"./2_results/colony/MV/Batch_Correction/sce_qc2_norm_mnnCorrect.rds"),
exprs.values.name = c("counts","counts", "logcounts", "mnnCorrected"), # Needs to be the same length as the number of sce.rds files.
plot.names = c("Raw data (QC1)", "QC filtered (QC2)", "QC2 + Norm", "MNN corrected"),
variables.to.measure = c("total", "Batch", "Treatment", "subsets_Gene_detected", "subsets_TE_detected", "subsets_Mito_detected"),
output.file.dir = "figures/colony/MV/Batch_Correction",
output.file.name = "ExplanatoryVariables-mnncorrected.png")
include_graphics('./figures/colony/MV/Batch_Correction/PCA_mnnCorrected.png')
include_graphics('./figures/colony/MV/QC/colony_QC2-PCA.png')
include_graphics('./figures/colony/MV/Batch_Correction/ExplanatoryVariables-mnnCorrected.png')
Plotting the methylation levels of colonies passing QC to identify an appropriate level/cut-off for ‘methylation retaining’ treated colonies (high methylation) relative to untreated colonies.
Using a cut-off at 75% for ‘methylation retaining’ seems appropriate when comparing batches independently and combined.
include_graphics('./figures/colony/MV/Colony_methylation_levels_by_batches_and_Tx.png')
include_graphics('./figures/colony/MV/Colony_methylation_levels_combined_batches.png')
Using mnnCorrected data which is the ‘most consistent’ across all three cell lines… (Albeit not optimal for each). The following steps are performed:
Here, the top 2000 highly variable genes (HVGs) are identified and plotted by PCA displaying treatment status, high vs low methylation and batch.
source("./scripts/HVGs_identify_cluster_heatmap.R")
HVGs_select_and_PCA(sce.rds.file ="2_results/colony/MV/Batch_Correction/sce_qc2_norm_mnnCorrect_scMerged_2C.rds",
chr.to.keep = c("1", "2", "3", "4", "5", "6","7","8", "9","10","11", "12","13","14", "15","16","17", "18","19","20","21", "22", NA), # NA represents TEs
Keep.genes.TEs <- c("gene","TE"), # Enables HVG selection to filter for either just gene or TE, or both N.B. Make sure 'chr.to.keep' doesn't filter out TEs if you want to keep them.
output.fig.dir = "./figures/colony/MV/HVGs/",
output.file.dir = "./2_results/colony/MV/HVGs/",
output.file.name = 'HVGs_2000.txt',
batch.correction.method = "mnnCorrected",
PCA.methylation.threshold = 75,
Number.of.genes = 2000,
size = 3,
width = 6,
height = 4,
alpha = 0.6,
svg = TRUE,
all_cols = c("unt" = "green", "aza" = "blue", "dac" = "red",
"rep1" = "orange", "rep2" = "violet", "rep3" = "brown"),
set.the.seed = TRUE)
Cell.type <- "MV"
include_graphics(paste0('./figures/colony/',Cell.type,'/HVGs/PCA_mnnCorrected_Top_2000_HVGs-ByBatch.png'))
Cell.type <- "MV"
include_graphics(paste0('./figures/colony/',Cell.type,'/HVGs/PCA_mnnCorrected_Top_2000_HVGs-ByTreatment.png'))
source("./scripts/HVGs_identify_cluster_heatmap.R")
Heatmap_from_list(sce.rds.file ="2_results/colony/MV/Batch_Correction/sce_qc2_norm_mnnCorrect_scMerged_2C.rds",
HVG.table = "./figures/colony/MV/HVGs/HVGs_2000.txt",
methylation.threshold = 75,
kmeans.clusters = 8,
kmeans.iter = 50,
kmeans.nstart = 30,
output.fig.dir = "./figures/colony/MV/HVGs/",
output.file.dir = "./2_results/colony/MV/HVGs/",
batch.correction.method = "mnnCorrected",
plot.Heatmap.Row.Normalisations = c("Zscored"), # Can be restricted to just one or two options. c("Zscored", "Mean_Centred", "Raw")
plot.Pheatmap.style = FALSE,
plot.Complex.Heatmap.NoRowCluster.style = TRUE,
plot.Complex.Heatmap.WithRowCluster.style = TRUE,
set.the.seed = TRUE)
Cell.type <- "MV"
Row.normalisation <- "Zscored"
include_graphics(paste0('./figures/colony/',Cell.type,'/HVGs/',Row.normalisation,'/Complex.Heatmap_HVGs_Kmeans_within_clusters_Zscored.png'))
All of these analyses are using the ClusterProfiler package. The
functions used here are wrappers for:
* clusterProfiler::compareCluster() # For GO
* clusterProfiler::enrichGO() # For GO individually looping
clusters.
* clusterProfiler::enricher() # MsigDB
* clusterProfiler::enrichKEGG() # KEGG
source("./scripts/ClusterProfiler_ORA_functions.R")
Perform_ORA_GO(Cluster.run.file = "./figures/colony/MV/HVGs/Zscored/Kmeans_cluster_list_Zscored.txt",
cluster.number.s = c(7,1,4,8,3,6,2,5), # This will also determine the order of the groups in the plots
sce.rds.file ="2_results/colony/MV/Batch_Correction/sce_qc2_norm_mnnCorrect_scMerged_2C.rds", # For background
batch_correction_method = "mnnCorrected",
chr.to.keep = c("1", "2", "3", "4", "5", "6","7","8", "9","10","11", "12","13","14", "15","16","17", "18","19","20","21", "22", NA), # NA represents TEs
Keep.genes.TEs = c("gene"),
Compare.Cluster = TRUE,
categories.to.test = c("BP"),
p.adj.method = "fdr",
p.adj.threshold = 0.05,
q.val.threshold = 0.4,
plot.key.colour = "p.adjust",
showCategory.number = 3,
output.fig.dir = "./figures/colony/MV/HVGs/Zscored/ORA/",
output.file.dir = "./2_results/colony/MV/HVGs/Zscored/ORA/",
output.extension.name = "",
add.plots = FALSE,
plot.dims.manual = TRUE,
plotx.height.png = 22, # Adjust as required (some defaults provided in the script)
plotx.width.png = 12,
plotx.height.svg = 22,
plotx.width.svg = 12,
ploty.height.png = 17,
ploty.width.png = 25,
ploty.height.svg = 14,
ploty.width.svg = 26)
include_graphics(paste0('./figures/colony/MV/HVGs/Zscored/ORA/Kmeans_cluster_list_Zscored_compareCluster_GO-BP_ploty_top3.png'))
No plots. Just table of results
source("./scripts/ClusterProfiler_ORA_functions.R")
Perform_ORA_GO(Cluster.run.file = "./figures/colony/MV/HVGs/Zscored/Kmeans_cluster_list_Zscored.txt",
cluster.number.s = c(1:8),
sce.rds.file ="2_results/colony/MV/Batch_Correction/sce_qc2_norm_mnnCorrect_scMerged_2C.rds", # For background
batch_correction_method = "mnnCorrected",
chr.to.keep = c("1", "2", "3", "4", "5", "6","7","8", "9","10","11", "12","13","14", "15","16","17", "18","19","20","21", "22", NA), # NA represents TEs
Keep.genes.TEs = c("gene"),
Compare.Cluster = FALSE,
categories.to.test = c("BP","CC", "MF"),
p.adj.method = "fdr",
p.adj.threshold = 0.05,
q.val.threshold = 1,
plot.key.colour = "p.adjust",
showCategory.number = NA,
output.fig.dir = "./figures/colony/MV/HVGs/Zscored/ORA/",
output.file.dir = "./2_results/colony/MV/HVGs/Zscored/ORA/",
output.extension.name = "",
add.plots = FALSE)
No plots. Just table of results
http://yulab-smu.top/biomedical-knowledge-mining-book/clusterprofiler-kegg.html
source("./scripts/ClusterProfiler_ORA_functions.R")
Perform_ORA_KEGG(Cluster.run.file = "./figures/colony/MV/HVGs/Zscored/Kmeans_cluster_list_Zscored.txt",
cluster.number.s = c(1:8),
sce.rds.file ="2_results/colony/MV/Batch_Correction/sce_qc2_norm_mnnCorrect_scMerged_2C.rds", # For background
batch_correction_method = "mnnCorrected",
p.adj.method = "fdr",
p.adj.threshold = 0.05,
q.val.threshold = 1,
plot.key.colour = "p.adjust",
output.fig.dir = "./figures/colony/MV/HVGs/Zscored/ORA/",
output.file.dir = "./2_results/colony/MV/HVGs/Zscored/ORA/",
output.extension.name = "",
chr.to.keep = c("1", "2", "3", "4", "5", "6","7","8", "9","10","11", "12","13","14", "15","16","17", "18","19","20","21", "22", NA), # NA represents TEs
Keep.genes.TEs = c("gene"))
No plots. Just table of results
source("./scripts/ClusterProfiler_ORA_functions.R")
Perform_ORA_MSigDB(Cluster.run.file = "./figures/colony/MV/HVGs/Zscored/Kmeans_cluster_list_Zscored.txt",
cluster.number.s = c(1:8),
sce.rds.file ="2_results/colony/MV/Batch_Correction/sce_qc2_norm_mnnCorrect_scMerged_2C.rds", # For background
batch_correction_method = "mnnCorrected",
chr.to.keep = c("1", "2", "3", "4", "5", "6","7","8", "9","10","11", "12","13","14", "15","16","17", "18","19","20","21", "22", NA), # NA represents TEs
Keep.genes.TEs = c("gene"),
p.adj.method = "fdr",
msigdb.categories.to.test = c("H","C1","C2","C3","C4","C5","C6"),
p.adj.threshold = 0.05,
q.val.threshold = 1,
plot.key.colour = "p.adjust",
output.fig.dir = "./figures/colony/MV/HVGs/Zscored/ORA/",
output.file.dir = "./2_results/colony/MV/HVGs/Zscored/ORA/",
output.extension.name = "",
add.plots = FALSE)
This is being run in DAC only, as they were the only treatment group with a high number of methylating retaining colonies and we were particularly interested to see what genes might be differentially regulated in these (methlyation retaining) colonies.
Steps: i) Filter for only dac cells. ii) Correlate expression of all genes and TEs against global methylation levels. iii) Add multiple testing correction and gene info. iv) Plot basic correlation summary
source("./scripts/Correlate_gene_vs_methylation.R")
correlate_expr_methyl(sce.rds.file ="2_results/colony/MV/Batch_Correction/sce_qc2_norm_mnnCorrect_scMerged_2C.rds",
sample.sheet = "1_data/colony/MV/Sample_analysis_sheet.csv",
treatments = "dac",
output.fig.dir = "./figures/colony/MV/Correlated_Expression/",
output.file.dir = "./2_results/colony/MV/Correlated_Expression/",
batch.correction.method = "mnnCorrected")
Histogram of correlation estimate values
include_graphics('./figures/colony/MV/Correlated_Expression/Expr_Meth_Corr_est_hist_dac.png')
Correlation estimate vs padj values
include_graphics('./figures/colony/MV/Correlated_Expression/Expr_Meth_Corr_est_vs_pval_dac.png')
No Gene Ontology Over-Representation Analysis was performed as there were too few ‘significantly correlated genes’ that were identified within this dataset and subsequently used for the Kmeans clustering…
source("./scripts/Go_analysis_and_plotting_wrapper.R")
Run.Number <- 2
Cluster.Number <- 1
Perform_GO_and_plot_from_cluster_results(Cluster.run = paste0("./2_results/colony/MV/Correlated_Expression/Methylation_Exp_mnnCorrected_correlation_and_heatmap_cluster_Run",Run.Number,".csv"),
cluster.number = Cluster.Number,
sce_used ="2_results/colony/MV/Batch_Correction/sce_qc2_norm_mnnCorrect_scMerged_2C.rds", # For background
batch_correction_method = "mnnCorrected",
p.adj.method = "fdr",
p.adj.threshold = 0.05,
plot.key.colour = "p.adjust",
output.fig.dir = "./figures/colony/MV/Correlated_Expression/Meth_low_downreg/",
output.file.dir = "./2_results/colony/MV/Correlated_Expression/Meth_low_downreg/",
output.extension.name = paste0("Run",Run.Number,"C",Cluster.Number) ) # automatically creates extension with 'RunXCY'